Hash Tables

Zhao Cong

Direct-address tables

  • An array, or direct-address table, denoted by , in which each position, or slot, corresponds to a key in the universe U .
  • If the set contains no element with key , then .
  • Direct addressing is a simple technique that works well when the universe of keys is reasonably small
    1
    2
    3
    4
    5
    6
    DIRECT-ADDRESS-SEARCH(T, k) 
    1 return T[k]
    DIRECT-ADDRESS-INSERT(T, x)
    1 T[x.key] =x
    DIRECT-ADDRESS-DELETE(T, x)
    1 T[x.key] = NIL
    Each of these operations takes only time.

Hash tables

  • Specifically, we can reduce the storage requirement to ‚ while we maintain the benefit that searching for an element in the hash table still requires only time. The catch is that this bound is for the average-case time, whereas for direct addressing it holds for the worst-case time.
  • we use a hash function to compute the slot from the key . Here, maps the universe of keys into the slots of a hash table
  • We say that an element with key hashes to slot ; we also say that is the hash value of key .
  • There is one hitch: two keys may hash to the same slot. We call this situation a collision
  • Of course, the ideal solution would be to avoid collisions altogether. We might try to achieve this goal by choosing a suitable hash function h. One idea is to make h appear to be “random,” thus avoiding collisions or at least minimizing their number.
  • Because , however, there must be at least two keys that have the same hash value; avoiding collisions altogether is therefore impossible.

Collision resolution by chaining

  • In chaining, we place all the elements that hash to the same slot into the same linked list
1
2
3
4
5
6
CHAINED-HASH-INSERT(T, x) 
1 insert x at the head of list T[h(x.key)]
CHAINED-HASH-SEARCH(T, k)
1 search for an element with key k in list T[h(k)]
CHAINED-HASH-DELETE(T, x)
1 delete x from the list T[h(x.key)]
  • We can delete an element in time if the lists are doubly linked.If the hash table supports deletion, then its linked lists should be doubly linked so that we can delete an item quickly.
  • With singly linked lists, both deletion and searching would have the same asymptotic running times.

Analysis of hashing with chaining

  • Given a hash table with slots that stores elements, we define the load factor , for as , that is, the average number of elements stored in a chain.
  • The worst-case behavior of hashing with chaining is terrible: all keys hash to the same slot, creating a list of length .,no better than if we used one linked list for all the elements
  • we shall assume that any given element is equally likely to hash into any of the m slots, independently of where any other element has hashed to. We call this the assumption of simple uniform hashing.
  • For , let us denote the length of the list by , so that ;and the expected value of is
  • Theorem 11.1:In a hash table in which collisions are resolved by chaining, an unsuccessful search takes average-case time , under the assumption of simple uniform hashing.
  • Theorem 11.2: In a hash table in which collisions are resolved by chaining, a successful search takes average-case time , under the assumption of simple uniform hashing.

Proof: The number of elements examined during a successful search for an element x is one more than the number of elements that appear before in ’s list. Because new elements are placed at the front of the list, elements before in the list were all inserted after was inserted.

Let denote the element inserted into the table, for , and let . For keys and , we define the indicator random variable Under the assumption of simple uniform hashing, we have , and so by Lemma 5.1, E . Thus, the expected number of elements examined in a successful search is:

Thus, the total time required for a successful search (including the time for computing the hash function) is

Hash functions

What makes a good hash function?

  • A good hash function satisfies (approximately) the assumption of simple uniform hashing
  • A good approach derives the hash value in a way that we expect to be independent of any patterns that might exist in the data
  • we note that some applications of hash functions might require stronger properties than are provided by simple uniform hashing

Interpreting keys as natural numbers

  • Most hash functions assume that the universe of keys is the set of natural numbers.

The division method

  • the hash function is
  • A prime not too close to an exact power of 2 is often a good choice for .

The multiplication method

  • The multiplication method for creating hash functions operates in two steps.
    • First, we multiply the key by a constant in the range and extract the fractional part of .
    • Then, we multiply this value by m and take the floor of the result. In short, the hash function is
  • An advantage of the multiplication method is that the value of m is not critical
  • The optimal choice depends on the characteristics of the data being hashed. Knuth suggests that

Universal hashing

  • choose the hash function randomly in a way that is independent of the keys that are actually going to be stored. This approach, called universal hashing
  • Let be a finite collection of hash functions that map a given universe of keys into the range. Such a collection is said to be universal if for each pair of distinct keys , the number of hash functions for which is at most .
  • Theorem 11.3:Suppose that a hash function h is chosen randomly from a universal collection of hash functions and has been used to hash n keys into a table T of size m, using chaining to resolve collisions. If key k is not in the table, then the expected length of the list that key hashes to is at most the load factor .If key is in the table, then the expected length of the list containing key is at most
  • Corollary 11.4:Using universal hashing and collision resolution by chaining in an initially empty table with slots, it takes expected time to handle any sequence of n INSERT, SEARCH, and DELETE operations containing INSERT operations.

Designing a universal class of hash functions

The family of all such hash functions is Since we have choices for a and choices for b, the collection contains hash functions

  • Theorem 11.5:The class of hash functions is universal.

Open addressing

  • In open addressing, all elements occupy the hash table itself. That is, each table entry contains either an element of the dynamic set or NIL.
1
2
3
4
5
6
7
8
9
10
HASH-INSERT(T,k) 
1 i=0
2 repeat
3 j = h(k,i)
4 if T[j] == NIL
5 T[j]=k
6 return j
7 else i=i+1
8 unti i == m
9 error “hash table overflow”
1
2
3
4
5
6
7
8
9
10
HASH-SEARCH(T,k) 
1 i=0
2 repeat
3 j = h(k,i)
4 if T[j] == k
5 T[j]=k
6 return j
7 else i=i+1
8 unti T[j] == NIL or i == m
9 return NIL
  • Deletion from an open-address hash table is difficult,We can solve this problem by marking the slot, storing in it the special value DELETED instead of NIL.We would then modify the procedure HASH-INSERT to treat such a slot as if it were empty so that we can insert a new key there. We do not need to modify HASH-SEARCH, since it will pass over DELETED values while searching.
  • we assume uniform hashing: the probe sequence of each key is equally likely to be any of the permutations of .

techniques to compute the probe sequences

Linear probing

Given an ordinary hash function , which we refer to as an auxiliary hash function, the method of linear probing uses the hash function Because the initial probe determines the entire probe sequence, there are only distinct probe sequences.

Linear probing is easy to implement, but it suffers from a problem known as primary clustering

Quadratic probing

Quadratic probing uses a hash function of the form This methods leads to a milder form of clustering, called secondary clustering. As in linear probing, the initial probe determines the entire sequence, and so only distinct probe sequences are used.

Double hashing

Double hashing uses a hash function of the form is an offset.

Analysis of open-address hashing

Perfect hashing